{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Temporal-Difference Methods\n",
    "\n",
    "In this notebook, you will write your own implementations of many Temporal-Difference (TD) methods.\n",
    "\n",
    "While we have provided some starter code, you are welcome to erase these hints and write your code from scratch.\n",
    "\n",
    "---\n",
    "\n",
    "### Part 0: Explore CliffWalkingEnv\n",
    "\n",
    "We begin by importing the necessary packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "import gym\n",
    "import numpy as np\n",
    "from collections import defaultdict, deque\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "import check_test\n",
    "from plot_utils import plot_values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the code cell below to create an instance of the [CliffWalking](https://github.com/openai/gym/blob/master/gym/envs/toy_text/cliffwalking.py) environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('CliffWalking-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The agent moves through a $4\\times 12$ gridworld, with states numbered as follows:\n",
    "```\n",
    "[[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11],\n",
    " [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],\n",
    " [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],\n",
    " [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]\n",
    "```\n",
    "At the start of any episode, state `36` is the initial state.  State `47` is the only terminal state, and the cliff corresponds to states `37` through `46`.\n",
    "\n",
    "The agent has 4 potential actions:\n",
    "```\n",
    "UP = 0\n",
    "RIGHT = 1\n",
    "DOWN = 2\n",
    "LEFT = 3\n",
    "```\n",
    "\n",
    "Thus, $\\mathcal{S}^+=\\{0, 1, \\ldots, 47\\}$, and $\\mathcal{A} =\\{0, 1, 2, 3\\}$.  Verify this by running the code cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(env.action_space)\n",
    "print(env.observation_space)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this mini-project, we will build towards finding the optimal policy for the CliffWalking environment.  The optimal state-value function is visualized below.  Please take the time now to make sure that you understand _why_ this is the optimal state-value function.\n",
    "\n",
    "_**Note**: You can safely ignore the values of the cliff \"states\" as these are not true states from which the agent can make decisions.  For the cliff \"states\", the state-value function is not well-defined._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define the optimal state-value function\n",
    "V_opt = np.zeros((4,12))\n",
    "V_opt[0:13][0] = -np.arange(3, 15)[::-1]\n",
    "V_opt[0:13][1] = -np.arange(3, 15)[::-1] + 1\n",
    "V_opt[0:13][2] = -np.arange(3, 15)[::-1] + 2\n",
    "V_opt[3][0] = -13\n",
    "\n",
    "plot_values(V_opt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 1: TD Control: Sarsa\n",
    "\n",
    "In this section, you will write your own implementation of the Sarsa control algorithm.\n",
    "\n",
    "Your algorithm has four arguments:\n",
    "- `env`: This is an instance of an OpenAI Gym environment.\n",
    "- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n",
    "- `alpha`: This is the step-size parameter for the update step.\n",
    "- `gamma`: This is the discount rate.  It must be a value between 0 and 1, inclusive (default value: `1`).\n",
    "\n",
    "The algorithm returns as output:\n",
    "- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.\n",
    "\n",
    "Please complete the function in the code cell below.\n",
    "\n",
    "(_Feel free to define additional functions to help you to organize your code._)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def sarsa(env, num_episodes, alpha, gamma=1.0):\n",
    "    # initialize action-value function (empty dictionary of arrays)\n",
    "    Q = defaultdict(lambda: np.zeros(env.nA))\n",
    "    # initialize performance monitor\n",
    "    # loop over episodes\n",
    "    for i_episode in range(1, num_episodes+1):\n",
    "        # monitor progress\n",
    "        if i_episode % 100 == 0:\n",
    "            print(\"\\rEpisode {}/{}\".format(i_episode, num_episodes), end=\"\")\n",
    "            sys.stdout.flush()   \n",
    "        \n",
    "        ## TODO: complete the function\n",
    "        \n",
    "    return Q"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the next code cell to visualize the **_estimated_** optimal policy and the corresponding state-value function.  \n",
    "\n",
    "If the code cell returns **PASSED**, then you have implemented the function correctly!  Feel free to change the `num_episodes` and `alpha` parameters that are supplied to the function.  However, if you'd like to ensure the accuracy of the unit test, please do not change the value of `gamma` from the default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# obtain the estimated optimal policy and corresponding action-value function\n",
    "Q_sarsa = sarsa(env, 5000, .01)\n",
    "\n",
    "# print the estimated optimal policy\n",
    "policy_sarsa = np.array([np.argmax(Q_sarsa[key]) if key in Q_sarsa else -1 for key in np.arange(48)]).reshape(4,12)\n",
    "check_test.run_check('td_control_check', policy_sarsa)\n",
    "print(\"\\nEstimated Optimal Policy (UP = 0, RIGHT = 1, DOWN = 2, LEFT = 3, N/A = -1):\")\n",
    "print(policy_sarsa)\n",
    "\n",
    "# plot the estimated optimal state-value function\n",
    "V_sarsa = ([np.max(Q_sarsa[key]) if key in Q_sarsa else 0 for key in np.arange(48)])\n",
    "plot_values(V_sarsa)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 2: TD Control: Q-learning\n",
    "\n",
    "In this section, you will write your own implementation of the Q-learning control algorithm.\n",
    "\n",
    "Your algorithm has four arguments:\n",
    "- `env`: This is an instance of an OpenAI Gym environment.\n",
    "- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n",
    "- `alpha`: This is the step-size parameter for the update step.\n",
    "- `gamma`: This is the discount rate.  It must be a value between 0 and 1, inclusive (default value: `1`).\n",
    "\n",
    "The algorithm returns as output:\n",
    "- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.\n",
    "\n",
    "Please complete the function in the code cell below.\n",
    "\n",
    "(_Feel free to define additional functions to help you to organize your code._)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def q_learning(env, num_episodes, alpha, gamma=1.0):\n",
    "    # initialize empty dictionary of arrays\n",
    "    Q = defaultdict(lambda: np.zeros(env.nA))\n",
    "    # loop over episodes\n",
    "    for i_episode in range(1, num_episodes+1):\n",
    "        # monitor progress\n",
    "        if i_episode % 100 == 0:\n",
    "            print(\"\\rEpisode {}/{}\".format(i_episode, num_episodes), end=\"\")\n",
    "            sys.stdout.flush()\n",
    "        \n",
    "        ## TODO: complete the function\n",
    "        \n",
    "    return Q"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the next code cell to visualize the **_estimated_** optimal policy and the corresponding state-value function. \n",
    "\n",
    "If the code cell returns **PASSED**, then you have implemented the function correctly!  Feel free to change the `num_episodes` and `alpha` parameters that are supplied to the function.  However, if you'd like to ensure the accuracy of the unit test, please do not change the value of `gamma` from the default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# obtain the estimated optimal policy and corresponding action-value function\n",
    "Q_sarsamax = q_learning(env, 5000, .01)\n",
    "\n",
    "# print the estimated optimal policy\n",
    "policy_sarsamax = np.array([np.argmax(Q_sarsamax[key]) if key in Q_sarsamax else -1 for key in np.arange(48)]).reshape((4,12))\n",
    "check_test.run_check('td_control_check', policy_sarsamax)\n",
    "print(\"\\nEstimated Optimal Policy (UP = 0, RIGHT = 1, DOWN = 2, LEFT = 3, N/A = -1):\")\n",
    "print(policy_sarsamax)\n",
    "\n",
    "# plot the estimated optimal state-value function\n",
    "plot_values([np.max(Q_sarsamax[key]) if key in Q_sarsamax else 0 for key in np.arange(48)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 3: TD Control: Expected Sarsa\n",
    "\n",
    "In this section, you will write your own implementation of the Expected Sarsa control algorithm.\n",
    "\n",
    "Your algorithm has four arguments:\n",
    "- `env`: This is an instance of an OpenAI Gym environment.\n",
    "- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n",
    "- `alpha`: This is the step-size parameter for the update step.\n",
    "- `gamma`: This is the discount rate.  It must be a value between 0 and 1, inclusive (default value: `1`).\n",
    "\n",
    "The algorithm returns as output:\n",
    "- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.\n",
    "\n",
    "Please complete the function in the code cell below.\n",
    "\n",
    "(_Feel free to define additional functions to help you to organize your code._)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def expected_sarsa(env, num_episodes, alpha, gamma=1.0):\n",
    "    # initialize empty dictionary of arrays\n",
    "    Q = defaultdict(lambda: np.zeros(env.nA))\n",
    "    # loop over episodes\n",
    "    for i_episode in range(1, num_episodes+1):\n",
    "        # monitor progress\n",
    "        if i_episode % 100 == 0:\n",
    "            print(\"\\rEpisode {}/{}\".format(i_episode, num_episodes), end=\"\")\n",
    "            sys.stdout.flush()\n",
    "        \n",
    "        ## TODO: complete the function\n",
    "        \n",
    "    return Q"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the next code cell to visualize the **_estimated_** optimal policy and the corresponding state-value function.  \n",
    "\n",
    "If the code cell returns **PASSED**, then you have implemented the function correctly!  Feel free to change the `num_episodes` and `alpha` parameters that are supplied to the function.  However, if you'd like to ensure the accuracy of the unit test, please do not change the value of `gamma` from the default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# obtain the estimated optimal policy and corresponding action-value function\n",
    "Q_expsarsa = expected_sarsa(env, 10000, 1)\n",
    "\n",
    "# print the estimated optimal policy\n",
    "policy_expsarsa = np.array([np.argmax(Q_expsarsa[key]) if key in Q_expsarsa else -1 for key in np.arange(48)]).reshape(4,12)\n",
    "check_test.run_check('td_control_check', policy_expsarsa)\n",
    "print(\"\\nEstimated Optimal Policy (UP = 0, RIGHT = 1, DOWN = 2, LEFT = 3, N/A = -1):\")\n",
    "print(policy_expsarsa)\n",
    "\n",
    "# plot the estimated optimal state-value function\n",
    "plot_values([np.max(Q_expsarsa[key]) if key in Q_expsarsa else 0 for key in np.arange(48)])"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
